class: inverse, center, middle
The emergence of high-throughput sequencing technologies such as 454 (Roche) and Solexa (Illumina) sequencing allowed for the highly parallel short read sequencing of DNA molecules.
overview
Sequencing typically performed on bulk tissue or cells.
Analysis of the bulk characteristics of data without understanding of hetergeneity of data.
Newer technologies such as TRAP from the Heintz lab or nuclei sorting allow for capture of distinct cell types based on expressed markers.
Pros - Allow for the capure of rare cell populations such as specific neuron types.
Cons - Require known markers for desired cell populations
With the advent of advanced microfluidics and refined sequencing technlogies, single-cell sequencing has emerged as a technology to profile individual cells from a heterogenous population without prior knowledge of cell populations.
Pros - No prior knowledge of cell populations required. - Simultaneously assess profiles of 1000s of cells.
Cons - Low sequencing sequencing depth for individual cells (1000s vs millions of reads for bulk).
Single-cell sequencing, as with bulk sequencing, has now been applied to the study of a wide range of differing assays.
Many companies offer single-cell sequencing technologies which may be used with the Illumina sequencer.
Two popular major companies offer the most used technologies.
Major difference between the two are the sequencing depth and coverage profiles across transcripts.
overview
overview
overview
overview
Read 1
Read 2
The sequence reads contain:-
As with standard bulk sequencing data, the next steps are typically to align the data to a reference genome/transcriptome and summarise data to a signal matrix.
For the processing of scRNA/snRNA from fastQ to count matrix, there are many options available to us.
Alignment and counting - Cellranger count - STAR - STARsolo - Subread cellCounts
Pseudoalignment and counting - Salmon - Alevin - Kallisto - Bustools
The output of these tools is typically a matrix of the signal attributed to cells and genes (typically read counts).
This matrix is the input for all downstream post-processing, quality control, normalisation, batch correction, clustering, dimension reduction and differential expression analysis.
The output matrix is often stored in a compressed format such as:- - MEX (Market Exchange Format) - HDF5 (Hierarchical Data Format)
]
]
class: inverse, center, middle
Cell Ranger is a suite of tools for single cell processing and analysis available from 10x Genomics.
In this session we will make use of Cell Ranger Count tool to process our generated fastQ and create the required files. ]
]
Cell Ranger is available from the 10x genomics website.
Also available are pre-baked references for Human and Mouse genomes (GRCh37/38 and GRCm37)
] ]
Download the software
wget -O cellranger-7.1.0.tar.gz "https://cf.10xgenomics.com/releases/cell-exp/cellranger-7.1.0.tar.gz?Expires=1686030213&Policy=eyJTdGF0ZW1lbnQiOlt7IlJlc291cmNlIjoiaHR0cHM6Ly9jZi4xMHhnZW5vbWljcy5jb20vcmVsZWFzZXMvY2VsbC1leHAvY2VsbHJhbmdlci03LjEuMC50YXIuZ3oiLCJDb25kaXRpb24iOnsiRGF0ZUxlc3NUaGFuIjp7IkFXUzpFcG9jaFRpbWUiOjE2ODYwMzAyMTN9fX1dfQ__&Signature=jlyPxMYCCjFJZzrVMPcb7GSVZSYCCbyfnTQO2PlnCszG-ycgpHgUllHuV6l0ke7p5fPgM~m8xiQPiq-5VqDwdXKuGuKXtWMFFtakYnSroj7O79gQf2lKEXpRQlfeou5EEP4KQBjquwfbHZWs-NNFKyGMYjYrt6qxoyMzrcrgl2rEvRO7Pu8vwk0DJFnwRRRu~wFJEaDqUJ4vFXuKw1jT9aus~bzLeF4fsWDVQYfA7H71yc5zBvKxz1tfD-zTm7ARaA6j-gyC3ffQf9K5W7HSMJD8Iqez39-B8SHMwzsBv0o~uPlatSv-1YataeSHQQykRWxjZdrMg-5IL2neGrM8zA__&Key-Pair-Id=APKAI7S6A5RYOXBWRPDA"
Download reference for Human genome (GRCh38)
wget -O https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz
Unpack software and references
tar -xzvf cellranger-7.1.0.tar.gz
tar -xzvf refdata-gex-GRCh38-2020-A.tar.gz
export PATH=/PATH_TO_CELLRANGER_DIRECTORY/cellranger-7.1.0:$PATH
Now we have the downloaded Cell Ranger software and required pre-build reference for Human (GRCh38) we can start the generation of count data from scRNA-seq/snRNA-seq fastQ data.
Typically FastQ files for your scRNA run will have been generated using the Cell Ranger mkfastq toolset to produce a directory a fastQ files.
We can now use CellRanger count command with our reference and fastQ files to generate our count matrix and associated files.
cellranger count --id=my_run_name \
--fastqs=PATH_TO_FASTQ_DIRECTORY \
--transcriptome=/PATH_TO_CELLRANGER_DIRECTORY/refdata-gex-GRCh38-2020-A
If you are working with a genome which is not Human and/or mouse you will need to find another source for your Cell Ranger reference.
To create your own references you will need two additional files.
Used to genome annotation.
Stores position, feature (exon) and meta-feature (transcript/gene) information.
Importantly for Cell Ranger Count, only features labelled as exon (column 3) will be considered for counting signal in genes
Many genomes label mitochondrial genes with CDS and not exon so these must be updated
Now we have the gene models in the GTF format we can use the Cell Ranger mkgtf tools to validate our GTF and remove any unwanted annotation types using the attribute flag.
Below is an example of how 10x generated the GTF for the Human reference.
cellranger mkgtf Homo_sapiens.GRCh38.ensembl.gtf \
Homo_sapiens.GRCh38.ensembl.filtered.gtf \
--attribute=gene_biotype:protein_coding \
--attribute=gene_biotype:lncRNA \
--attribute=gene_biotype:antisense \
--attribute=gene_biotype:IG_LV_gene \
--attribute=gene_biotype:IG_V_gene \
--attribute=gene_biotype:IG_V_pseudogene \
--attribute=gene_biotype:IG_D_gene \
--attribute=gene_biotype:IG_J_gene \
--attribute=gene_biotype:IG_J_pseudogene \
--attribute=gene_biotype:IG_C_gene \
--attribute=gene_biotype:IG_C_pseudogene \
--attribute=gene_biotype:TR_V_gene \
--attribute=gene_biotype:TR_V_pseudogene \
--attribute=gene_biotype:TR_D_gene \
--attribute=gene_biotype:TR_J_gene \
--attribute=gene_biotype:TR_J_pseudogene \
--attribute=gene_biotype:TR_C_gene
Following filtering of your GTF to the required biotypes, we can use the Cell Ranger mkref tool to finally create our custom reference.
cellranger mkref --genome=custom_reference \
--fasta=custom_reference.fa \
--genes=custom_reference_filtered.gtf
class: inverse, center, middle
Having completed the Cell Ranger count step, the user will have created a folder named as set by the –id flag for count command.
Within this folder will be the outs/ directory containing all the outputs generated from Cell Ranger count.
The count matrices to be used for further analysis are stored in both MEX and HDF5 formats within the output directories.
The filtered matrix only contains detected, cell-associated barcodes whereas the raw contains all barcodes (background and cell-associated).
MEX format - filtered_feature_bc_matrix - raw_feature_bc_matrix
HDF5 format - filtered_feature_bc_matrix.h5 - raw_feature_bc_matrix.h5
The outs directory also contains a BAM file of alignments for all barcodes against the reference (possorted_genome_bam.bam) as well as an associated BAI index file (possorted_genome_bam.bam.bai).
This BAM file is often used in downstream analysis such as scSplit/Velocyto as well as for the generation of signal graphs such as bigWigs. ]
]
Cell Ranger also outputs files for visualisation within its own cloupe browser - cloupe.cloupe.
This allows for the visualisation of scRNA-seq/snRNA-seq as a t-sne/umap with the ability to overlay metrics of QC and gene expression onto the cells in real time
Cell Ranger will also output summaries of useful metrics as a text file (metrics_summary.csv) and as a intuitive web-page.
Metrics include
There are many potential issues which can arise in scRNA-seq/snRNA-seq data including -
]
Assessment of the overall quality of a scRNA-seq/snRNA-seq experiment and filtering of low quality or contaminated cell counts is an essential step in analysis.
class: inverse, center, middle
The web summary html file contains an interactive report describing the most essential QC for your single cell experiment as well as initial clustering and dimension reduction for your data.
The web summary also contains useful information on the input files and the versions used in this analysis for later reproducibility. ]
]
The first thing we can review is the Sample information panel.
This contains information on:-
]
]
The Sequencing panel highlights information on the quality of the illumina sequencing.
This contains information on:-
]
]
The Mapping panel highlights information on the mapping of reads to the reference genome and transcriptome.
This contains information on:-
]
]
The Cells panel highlights some of the most important information in the report, the total number of cells captured and the distribution of counts across cells and genes.
Information includes:-
]
]
The Cell panel also includes and interactive knee plot.
The knee plot shows:-
On the x-axis, the barcodes ordered by the most frequent on the left to the least frequent on the right
On the y-axis, the frequency of each ordered barcode.
Highlighted in blue are the barcodes marked as associated to cells.
]
]
It is apparent that barcodes labelled blue (cell-associated barcodes) do not have a cut-off based on UMI count.
In the latest version of Cell Ranger a two step process is used to define cell-associated barcodes based on the EmptyDrops method (Lun et al.,2019).
If required, a –force-cells flag can be used with cellranger count to identify a set number of cell-associated barcodes.
]
]
The Knee plot also acts a good QC tools to investigate differing types of single cell failure.
Whereas our previous knee plot represented a good sample, differing knee plot patterns can be indicative of specific problems with the single cell protocol.
In this example we see no specific cliff and knee suggesting a failure in the integration of oil, beads and samples (wetting failure) or a compromised sample.
]
]
If there is a clog in the machine we may see a knee plot where the overall number of samples is low.
]
]
There may be occasions where we see two sets of cliff-and-knees in our knee plot.
This could be indicative of a heterogenous sample where we have two populations of cells with differing overall RNA levels.
Knee plots should be interpreted in the context of the biology under investigation.
]
]
The web-summary also contains an analysis page where default dimension reduction, clustering and differential expressions between clusters has been performed.
Additionally the analysis page contains information on sequencing saturation and gene per cell vs reads per cell.
]
]
The t-sne plot shows the distribution and similarity within your data.
]
]
The sequence saturation and Median genes per cell plots show these calculations (as show on summary page) over successive downsampling of the data.
By reviewing the curve of the downsamlped metrics we can assess whether we are approaching saturation for either of these metrics.
]
]
class: inverse, center, middle
The Loupe browser is a tool for visualization of Cell Ranger cloupe files.
It provides t-sne visualization of your single-cell data alongside sample/cell information as well as methods to test and visualise changes in gene expression.
Loupe browser can be freely downloaded from the 10x website.
]
]
Having downloaded the Loupe browser we can load our cloupe files directly in and rapidly visually interrogate our data.
In todays session we will review some of the features available in Loupe using the PBMC example data set.
]
]